{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 第一章 初见网络爬虫"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 网络连接"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "输出在域名`http://pythonscraping.com`的服务器上`网络应用根地址/pages文件夹`里的HTML文件`page1.html`的源代码\n",
    "\n",
    "python程序只能读取已经请求的单个HTML文件"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "b'<html>\\n<head>\\n<title>A Useful Page</title>\\n</head>\\n<body>\\n<h1>An Interesting Title</h1>\\n<div>\\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\\n</div>\\n</body>\\n</html>\\n'\n"
     ]
    }
   ],
   "source": [
    "from urllib.request import urlopen\n",
    "\n",
    "html = urlopen('http://pythonscraping.com/pages/page1.html')\n",
    "print(html.read())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## BeautifulSoup简介"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<h1>An Interesting Title</h1>\n",
      "<h1>An Interesting Title</h1>\n"
     ]
    }
   ],
   "source": [
    "from urllib.request import urlopen\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "html = urlopen('http://www.pythonscraping.com/pages/page1.html')\n",
    "bs = BeautifulSoup(html.read(), 'html.parser')\n",
    "print(bs.html.body.h1)\n",
    "print(bs.h1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 可靠的网络连接"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "思考代码的总体布局，让代码既可以捕捉异常又容易阅读\n",
    "\n",
    "函数（具有周密的异常处理功能）让快速稳定地网络数据采集变得简单易行"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The server could not be found!\n"
     ]
    }
   ],
   "source": [
    "from urllib.request import urlopen\n",
    "from urllib.error import HTTPError\n",
    "from urllib.error import URLError\n",
    "\n",
    "try:\n",
    "    html = urlopen(\"https://pythonscrapingthisurldoesnotexist.com\")\n",
    "except HTTPError as e:\n",
    "    print(\"The server returned an HTTP error\")\n",
    "except URLError as e:\n",
    "    print(\"The server could not be found!\")\n",
    "else:\n",
    "    print(html.read())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<h1>An Interesting Title</h1>\n"
     ]
    }
   ],
   "source": [
    "from urllib.request import urlopen\n",
    "from urllib.error import HTTPError\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "\n",
    "def getTitle(url):\n",
    "    try:\n",
    "        html = urlopen(url)\n",
    "    except HTTPError as e:\n",
    "        return None\n",
    "    try:\n",
    "        bsObj = BeautifulSoup(html.read(), \"lxml\")\n",
    "        title = bsObj.body.h1\n",
    "    except AttributeError as e:\n",
    "        return None\n",
    "    return title\n",
    "\n",
    "\n",
    "title = getTitle(\"http://www.pythonscraping.com/pages/page1.html\")\n",
    "if title == None:\n",
    "    print(\"Title could not be found\")\n",
    "else:\n",
    "    print(title)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "calc(100% - 180px)",
    "left": "10px",
    "top": "150px",
    "width": "165px"
   },
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
